AITopics

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceDec-12-2025

A Simple Yet Strong Baseline for Long-Term Conversational Memory of LLM Agents

Zhou, Sizhe, Han, Jiawei

LLM-based conversational agents still struggle to maintain coherent, personalized interaction over many sessions: fixed context windows limit how much history can be kept in view, and most external memory approaches trade off between coarse retrieval over large chunks and fine-grained but fragmented views of the dialogue. Motivated by neo-Davidsonian event semantics, we propose an event-centric alternative that represents conversational history as short, event-like propositions which bundle together participants, temporal cues, and minimal local context, rather than as independent relation triples or opaque summaries. In contrast to work that aggressively compresses or forgets past content, our design aims to preserve information in a non-compressive form and make it more accessible, rather than more lossy. Concretely, we instruct an LLM to decompose each session into enriched elementary discourse units (EDUs) -- self-contained statements with normalized entities and source turn attributions -- and organize sessions, EDUs, and their arguments in a heterogeneous graph that supports associative recall. On top of this representation we build two simple retrieval-based variants that use dense similarity search and LLM filtering, with an optional graph-based propagation step to connect and aggregate evidence across related EDUs. Experiments on the LoCoMo and LongMemEval$_S$ benchmarks show that these event-centric memories match or surpass strong baselines, while operating with much shorter QA contexts. Our results suggest that structurally simple, event-level memory provides a principled and practical foundation for long-horizon conversational agents. Our code and data will be released at https://github.com/KevinSRR/EMem.

edus, large language model, natural language, (14 more...)

2511.17208

Country: North America > United States > Illinois (0.14)

Genre: Research Report > New Finding (0.54)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Neural Information Processing SystemsAug-15-2025, 03:29:14 GMT

Supplementary Materials for " T emporal-attentive Covariance Pooling Networks for Video Recognition " Zilin Gao

They obtain 0.5% and 0.7% gains over those with

dataset, emporal-attentive covariance pooling network, supplementary material, (10 more...)

Country:

Asia > China > Tianjin Province > Tianjin (0.05)
Asia > China > Liaoning Province > Dalian (0.05)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.71)

arXiv.org Artificial IntelligenceJul-28-2025

SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models

Lu, Hongyuan, Li, Zixuan, Zhang, Zefan, Lam, Wai

There are more than 7,000 languages around the world, and current Large Language Models (LLMs) only support hundreds of languages. Dictionary-based prompting methods can enhance translation on them, but most methods use all the available dictionaries, which could be expensive. Instead, it will be flexible to have a trade-off between token consumption and translation performance. This paper proposes a novel task called \textbf{A}utomatic \textbf{D}ictionary \textbf{S}election (\textbf{ADS}). The goal of the task is to automatically select which dictionary to use to enhance translation. We propose a novel and effective method which we call \textbf{S}elect \textbf{Lo}w-frequency \textbf{W}ords! (\textbf{SLoW}) which selects those dictionaries that have a lower frequency. Our methods have unique advantages. First, there is no need for access to the training data for frequency estimation (which is usually unavailable). Second, it inherits the advantage of dictionary-based methods, where no additional tuning is required on LLMs. Experimental results on 100 languages from FLORES indicate that SLoW surpasses strong baselines, and it can obviously save token usage, with many languages even surpassing the translation performance of the full dictionary baseline.\footnote{A shocking fact is that there is no need to use the actual training data (often unobtainable) for frequency estimation, and an estimation frequency obtained using public resources is still apparently effective in improving translation with ChatGPT and Llama, and DeepSeek.}\footnote{Code and data available upon publication.}

large language model, machine learning, translation, (19 more...)

2507.18902

Country:

Europe (0.93)
Asia > Middle East (0.69)
North America > United States > Minnesota (0.28)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsMay-27-2025, 13:03:44 GMT

Classic GNNs are Strong Baselines: Reassessing GNNs for Node Classification

Graph Transformers (GTs) have recently emerged as popular alternatives to traditional message-passing Graph Neural Networks (GNNs), due to their theoretically superior expressiveness and impressive performance reported on standard node classification benchmarks, often significantly outperforming GNNs. In this paper, we conduct a thorough empirical analysis to reevaluate the performance of three classic GNN models (GCN, GAT, and GraphSAGE) against GTs. Our findings suggest that the previously reported superiority of GTs may have been overstated due to suboptimal hyperparameter configurations in GNNs. Remarkably, with slight hyperparameter tuning, these classic GNN models achieve state-of-the-art performance, matching or even exceeding that of recent GTs across 17 out of the 18 diverse datasets examined. Additionally, we conduct detailed ablation studies to investigate the influence of various GNN configurations--such as normalization, dropout, residual connections, and network depth--on node classification performance.

gnn, node classification, reassessing gnn, (3 more...)

Genre: Research Report > New Finding (0.63)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceMar-21-2025

Strong Baseline: Multi-UAV Tracking via YOLOv12 with BoT-SORT-ReID

Chen, Yu-Hsi

Detecting and tracking multiple unmanned aerial vehicles (UAVs) in thermal infrared video is inherently challenging due to low contrast, environmental noise, and small target sizes. This paper provides a straightforward approach to address multi-UAV tracking in thermal infrared video, leveraging recent advances in detection and tracking. Instead of relying on the YOLOv5 with the DeepSORT pipeline, we present a tracking framework built on YOLOv12 and BoT-SORT, enhanced with tailored training and inference strategies. We evaluate our approach following the metrics from the 4th Anti-UAV Challenge and demonstrate competitive performance. Notably, we achieve strong results without using contrast enhancement or temporal information fusion to enrich UAV features, highlighting our approach as a "Strong Baseline" for the multi-UAV tracking task. We provide implementation details, in-depth experimental analysis, and a discussion of potential improvements. The code is available at https://github.com/wish44165/YOLOv12-BoT-SORT-ReID .

artificial intelligence, track 1, track 2, (17 more...)

2503.17237

Country:

Oceania > Australia (0.04)
North America > United States (0.04)
North America > Canada (0.04)
(2 more...)

Genre: Research Report (0.50)

Industry:

Information Technology (0.48)
Aerospace & Defense > Aircraft (0.34)

Technology: Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (1.00)

arXiv.org Artificial IntelligenceFeb-18-2025

MomentSeeker: A Comprehensive Benchmark and A Strong Baseline For Moment Retrieval Within Long Videos

Yuan, Huaying, Ni, Jian, Wang, Yueze, Zhou, Junjie, Liang, Zhengyang, Liu, Zheng, Cao, Zhao, Dou, Zhicheng, Wen, Ji-Rong

Retrieval augmented generation (RAG) holds great promise in addressing challenges associated with long video understanding. These methods retrieve useful moments from long videos for their presented tasks, thereby enabling multimodal large language models (MLLMs) to generate high-quality answers in a cost-effective way. In this work, we present MomentSeeker, a comprehensive benchmark to evaluate retrieval models' performance in handling general long-video moment retrieval (LVMR) tasks. MomentSeeker offers three key advantages. First, it incorporates long videos of over 500 seconds on average, making it the first benchmark specialized for long-video moment retrieval. Second, it covers a wide range of task categories (including Moment Search, Caption Alignment, Image-conditioned Moment Search, and Video-conditioned Moment Search) and diverse application scenarios (e.g., sports, movies, cartoons, and ego), making it a comprehensive tool for assessing retrieval models' general LVMR performance. Additionally, the evaluation tasks are carefully curated through human annotation, ensuring the reliability of assessment. We further fine-tune an MLLM-based LVMR retriever on synthetic data, which demonstrates strong performance on our benchmark. We perform extensive experiments with various popular multimodal retrievers based on our benchmark, whose results highlight the challenges of LVMR and limitations for existing methods. Our created resources will be shared with community to advance future research in this field.

artificial intelligence, large language model, natural language, (4 more...)

2502.12558

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.53)

Neural Information Processing SystemsJan-24-2025, 09:26:12 GMT

Reviews: MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

Quality: This paper suffers from a few critical issues. Clarity: The experiment setting ups can be described with more details. Sec 3.2 and 3.4 is missing important information such as the datasets used for conducting the experiments. Significance: Although the quality of the proposed model remains unclear because of the previously mentioned critical issues, it's a significant work because it's the first GAN-based model for spectrogram-to-waveform conversion which seems to be working at some degree. It's significantly over-claimed: 1) claiming state-of-the-art for spectrogram-to-waveform conversion (line 6) with MOS 3.09 is surprising; many previous works are at a much higher level (e.g.

conditional waveform synthesis, generative adversarial network, spectrogram-to-waveform conversion, (11 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.40)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.40)

arXiv.org Artificial IntelligenceOct-17-2024

AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents

Yang, Ke, Liu, Yao, Chaudhary, Sapana, Fakoor, Rasool, Chaudhari, Pratik, Karypis, George, Rangwala, Huzefa

Autonomy via agents using large language models (LLMs) for personalized, standardized tasks boosts human efficiency. Automating web tasks (like booking hotels within a budget) is increasingly sought after. Fulfilling practical needs, the web agent also serves as an important proof-of-concept example for various agent grounding scenarios, with its success promising advancements in many future applications. Prior research often handcrafts web agent strategies (e.g., prompting templates, multi-agent systems, search methods, etc.) and the corresponding in-context examples, which may not generalize well across all real-world scenarios. On the other hand, there has been limited study on the misalignment between a web agent's observation/action representation and the pre-training data of the LLM it's based on. This discrepancy is especially notable when LLMs are primarily trained for language completion rather than tasks involving embodied navigation actions and symbolic web elements. Our study enhances an LLM-based web agent by simply refining its observation and action space to better align with the LLM's capabilities. This approach enables our base agent to significantly outperform previous methods on a wide variety of web tasks. Specifically, on WebArena, a benchmark featuring general-purpose web interaction tasks, our agent AgentOccam surpasses the previous state-of-the-art and concurrent work by 9.8 (+29.4%) and 5.9 (+15.8%) absolute points respectively, and boosts the success rate by 26.6 points (+161%) over similar plain web agents with its observation and action space alignment. We achieve this without using in-context examples, new agent roles, online feedback or search strategies. AgentOccam's simple design highlights LLMs' impressive zero-shot performance on web tasks, and underlines the critical role of carefully tuning observation and action spaces for LLM-based agents.

large language model, machine learning, natural language, (18 more...)